Goto

Collaborating Authors

 state size


ebd9629fc3ae5e9f6611e2ee05a31cef-Supplemental.pdf

Neural Information Processing Systems

Dataset (1)consists ofvarious lines in the image at a discrete set of angles, and the classification task is to detect the angle of 14 the line. Some images from the test set ofclasses 80 and 100 are multiplied with apermutation matrix to randomly permute rows and columns.



Gated Slot Attention for Efficient Linear-Time Sequence Modeling

Neural Information Processing Systems

Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch.This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA).Essentially, GSA comprises a two-layer GLA linked via $\operatorname{softmax}$, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size.This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size.Additionally, retaining the $\operatorname{softmax}$ operation is particularly beneficial in ``finetuning pretrained Transformers to RNNs'' (T2R) settings, reducing the need for extensive training from scratch.Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.




Formulating Discrete Probability Flow Through Optimal Transport

Neural Information Processing Systems

Continuous diffusion models are commonly acknowledged to display a deterministic probability flow, whereas discrete diffusion models do not. In this paper, we aim to establish the fundamental theory for the probability flow of discrete diffusion models. Specifically, we first prove that the continuous probability flow is the Monge optimal transport map under certain conditions, and also present an equivalent evidence for discrete cases.



StateX: Enhancing RNN Recall via Post-training State Expansion

Shen, Xingyu, Chen, Yingfa, Thai, Zhen Leng, Han, Xu, Liu, Zhiyuan, Sun, Maosong

arXiv.org Artificial Intelligence

While Transformer-based models have demonstrated remarkable language modeling performance, their high complexities result in high costs when processing long contexts. In contrast, recurrent neural networks (RNNs) such as linear attention and state space models have gained popularity due to their constant per-token complexities. However, these recurrent models struggle with tasks that require accurate recall of contextual information from long contexts, because all contextual information is compressed into a constant-size recurrent state. Previous works have shown that recall ability is positively correlated with the recurrent state size, yet directly training RNNs with larger recurrent states results in high training costs. In this paper, we introduce StateX, a training pipeline for efficiently expanding the states of pre-trained RNNs through post-training. For two popular classes of RNNs, linear attention and state space models, we design post-training architectural modifications to scale up the state size with no or negligible increase in model parameters. Experiments on models up to 1.3B parameters demonstrate that StateX efficiently enhances the recall and in-context learning ability of RNNs without incurring high post-training costs or compromising other capabilities. Recently, recurrent neural networks (RNNs) such as gated linear attention (GLA) (Y ang et al., 2024) and Mamba2 (Dao & Gu, 2024) have shown promising capabilities in language modeling. These architectures have constant per-token complexity, while the more popular Transformer architecture (V aswani et al., 2023) has per-token complexity that grows linearly with the context length.



Scaling Context Requires Rethinking Attention

Gelada, Carles, Buckman, Jacob, Zhang, Sean, Bach, Txus

arXiv.org Artificial Intelligence

We argue that neither transformers nor sub-quadratic architectures are well suited to training at long sequence lengths: the cost of processing the context is too expensive in the former, too inexpensive in the latter. Approaches such as sliding window attention which reduce the cost-per-token of a transformer impair in-context learning, and so are also unsuitable. To address these limitations, we introduce power attention, an architectural layer for linear-cost sequence modeling whose state size can be adjusted independently of parameters, unlocking the advantages of linear attention on practical domains. We develop and open-source a set of GPU kernels for efficient power attention, identifying a novel pattern of operation fusion to avoid memory and bandwidth bottlenecks. Our experiments on the in-context learning of power attention shows that these models dominate both exponential attention and linear attention at long-context training.